Stemming Approaches for East European Languages

نویسندگان

  • Ljiljana Dolamic
  • Jacques Savoy
چکیده

During this CLEF evaluation campaign, the first objective is to propose and evaluate various indexing and search strategies for the Czech language that will hopefully result in more effective retrieval than language-independent approaches (n-gram). Based on the stemming strategy we developed for other languages, we propose that for the Slavic language a light stemmer (inflectional only) and also a second one based on a more aggressive suffix-stripping scheme that will remove some derivational suffixes. Our second objective is to undertake further study of the relative merit of various search engines when exploring Hungarian and Bulgarian documents. To evaluate these solutions we use various effective IR models. Our experiments generally show that for the Bulgarian language, removing certain frequently used derivational suffixes may improve mean average precision. For the Hungarian corpus, applying an automatic decompounding procedure improves the MAP. For the Czech language a comparison of a light and a more aggressive stemmer to remove both inflectional and some derivational suffixes, reveals only small performance differences. For this language only, performance differences between a word-based or a 4-gram indexing strategy are also rather small.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stemming Strategies for European Languages

In this paper, we describe and evaluate different general stemming approaches for the French, Portuguese (Brazilian), German and Hungarian languages. Based on the CLEF test-collections, we demonstrate that light stemming approaches are quite effective for the French, Portuguese and Hungarian languages, and perform reasonably well for the German language. Variations in mean average precision amo...

متن کامل

Monolingual Document Retrieval: English versus other European Languages

The vast majority of research in information retrieval is done using English collections and topics. This raises questions about the effectiveness of retrieval strategies for other languages. To examine this issue, we focus on document retrieval in nine European languages. In particular, we investigate the effectiveness of language-dependent approaches to document retrieval, such as stemming an...

متن کامل

Language-Dependent and Language-Independent Approaches to Cross-Lingual Text Retrieval

We investigates the effectiveness of language-dependent approaches to document retrieval, such as stemming and decompounding, and constrast them with language-independent approaches, such as character n-gramming. In order to reap the benefits of more than one type of approach, we also consider the effectiveness of the combination of both types of approaches. We focus on document retrieval in ni...

متن کامل

Lexical and Algorithmic Stemming Compared for 9 European Languages with Hummingbird SearchServerTM at CLEF 2003

Hummingbird participated in the monolingual information retrieval tasks of the Cross-Language Evaluation Forum (CLEF) 2003: for natural language queries in 9 European languages (German, French, Italian, Spanish, Dutch, Finnish, Swedish, Russian and English) find all the relevant documents (with high precision) in the CLEF 2003 document sets. For each language, SearchServer scored higher than th...

متن کامل

Data Fusion for Effective European Monolingual Information Retrieval

For our fourth participation in the CLEF evaluation campaigns, our first objective was to propose an effective and general stopword list and a light stemming procedure for the Portuguese language. Our second objective was to obtain a better picture of the relative merit of various search engines when processing documents in the Finnish and Russian languages. Finally, based on the Z-score method...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007